CO 

O 



(N 



Adaptive Cluster Expansion (ACE): A 
Multilayer Network for Estimating Probability 
Density Functions* 



S P Luttrell 
December 17, 2010 

O . 

Abstract: We derive an adaptive hierarchical method of estimating high dimen- 
sional probability density functions. We call this method of density estimation 
\Q • the "adaptive cluster expansion", or ACE for short. We present an application of 

this approach, based on a multilayer topographic mapping network, that adap- 
tively estimates the joint probability density function of the pixel values of an 
PQ'! image, and presents this result as a "probability image". We apply this to the 

problem of identifying statistically anomalous regions in otherwise statistically 
homogeneous images. 



1 Introduction 



The purpose of this paper is to develop a novel type of adaptive network for es- 
• timating probability density functions (PDF) for use in Bayesian analysis [TJ [2] . 

We consider only techniques that scale well for use in high dimensional spaces, 
. such as the analysis of large arrays of pixels in image processing. There are 

many attempts to solve this type of density estimation problem. For instance, 
the Boltzmann machine [3] is essentially a trainable Gibbs distribution, which 
permits arbitrarily complicated statistical structure to be modelled via hidden 
variables. Unfortunately, this generality must be paid for by performing lengthy 
Monte Carlo simulations. There are various extensions to this technique, such as 
the higher order Boltzmann machine @], which capture higher order statistical 
' behaviour more economically, but none of these variations has been shown to 

be suitable for high-dimensional image processing problems. Using maximum 
entropy techniques [5], we develop a number of variations on the Gibbs distribu- 
tion approach [6 ! , and propose a scheme in which we replace simple interactions 
between a large number of hidden variables (as in the Boltzmann machine) by 
complicated interactions which directly model the statistical structure of the 
data; this is an extreme form of the approach taken in [4]. 

The novel adaptive density estimator that we develop in [5] is based on 
a multilayer network, in which we choose the layer-to-layer connections to be 
hierarchical, and the layer-to-layer transformations to be topographic mappings 
|7J; this adaptively transforms the input data into a multiscale "pyramid- like" 
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format. In [B] we further propose that the joint PDFs of adjacent nodes in each 
layer should be combined to form an estimate of the joint PDF of the nodes in 
the input layer. By analogy with the standard derivation of Gibbs distributions, 
we can also derive our joint PDF estimate by applying the maximum entropy 
method [5J. However, our result is computationally much cheaper to implement 
than a standard Gibbs distribution, because we do not need to perform Monte 
Carlo simulations in order to integrate over the states of hidden variables. We 
suggest the name "adaptive cluster expansion" (ACE) for this type of network 
estimate of high-dimensional joint PDFs. Other literature on this approach can 
be found in [5J [§J HB EJ UH , where we further develop multilayer topographic 
mapping networks, and their relationship to vector quantisers. 

The purpose of this paper is to present a complete account of ACE, and to 
demonstrate its effectiveness when applied to the problem of density estimation. 
We do not dwell on the details of how to implement the topographic mapping 
training algorithm (we review this in the appendix). In Section[5]we develop the 
ACE method of density estimation by appealing to simple counting arguments. 
In Section [3] we demonstrate the power of ACE by applying it to the problem 
of estimating the joint PDF of the pixels of textured images selected from the 
Brodatz album |13] , 

2 Probability Density Function Estimation 

In this section we present a derivation of the ACE estimate Q{x) of a PDF 
P{x). We develop this result by appealing to simple counting arguments and 
by using a diagrammatic language. 

2.1 Derivation of the ACE Estimate of a PDF 




X X X X 

12 3 4 



Figure 1: Basic 2-layer network. The input space (layer 0) is 4-dimensional and 
the output space (layer 1) is 2-dimensional. The layer 0-to-l transformation 
factorises into two independent transformations: y\ depends only on (x\,X2), 
and j/2 depends only on {x^,Xi). 

We show a simple network in FigureQ] where the input space is 4-dimensional, 
the output space is 2-dimensional, and we factorise the feedforward transforma- 
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tion as y(x) — (2/1(211, x 2 ), 2/2 (^3 > ^4))- Suppose we estimate the joint PDF 
Pout (j/ij 2/2) °f the outputs, and the joint PDFs P in ,i2(xi, x 2 ) and ^11,34(2:3,^4) 
of each pair of inputs, by measuring their histograms, for instance. Using this 
information alone we now wish to construct an estimate Q(x) of the true joint 
PDF P(x) of the 4-dimensional input. There are two alternative, but equivalent, 
ways of writing Q(x), each of which has its own interesting interpretation. 
Firstly, we may write 



r\f \ Tt t I \ I \\ -fin, 12(^1,^2) Pin^O^i X&) % 

Q{x) = P nt{yi{xi,x 2 ),y 2 (x 3 ,x 4 )) — — -— - — -— ^- 1 

Paut{yi{Xl,X 2 )) Pout (2/2 (,X 3 ,Xi)) 

In Equation [JJ we construct Q(x) as follows. We use P ut(j/i, 2/2) directly to 
estimate the joint PDF of the outputs, and indirectly to estimate the joint PDF 
of the inputs. In order to convert a PDF in output space (i.e. P ou t(2/lj 2/2)) into 
a PDF in input space we must divide P ou t(2/i, 2/2) by a compression factor equal 
to the number of input values that can produce the observed output value. 
Because we obtain y\ and y 2 separately from the pairs {x\,x 2 ) and (x 3 ,Xi), 
respectively, this compression factor is the product of two separate factors. For 
instance, the compression factor corresponding to 2/1 is the ratio ^p°' ^(xi 'xl)) > 
where (P; n , 12(2:1, x 2 )) is the average value of P n , 12 (2:1, 2:2) over all the (x%,x 2 ) 
that produce the same value of 2/1 • However, we may refine this compression 
factor by using p n , 12(2:1, 2:2) instead of (P n , 12(^1, x 2 )) in the denominator, to 
yield the ratio ^'^'j'^ ■ An analogous argument may be applied to obtain 
the compression factor corresponding to y 2 , and the results combined to obtain 
the final expression for Q(x) as shown in Equation [TJ 

nt \ v f \v 1 \ Pont{yi{xi,x 2 ),y 2 {x 3 ,x i )) 

Q{x) = P„,i 2 (xi,a: 2 )P n ,34(a;3,a;4) , , , , rr (2) 

Pout (2/1 (Xi,X 2 )) Pout (2/2 (X 3 , X4) ) 

which is trivially the same as Equation[TJ but we have arranged its terms in a new 
way. This furnishes us with an alternative interpretation of Q(x). Thus, imagine 
that we are provided only with p n , 12(2^1, x 2 ) and P n , 34(2:3, 2:4), and no informa- 
tion about the correlations between the pair (xi, x 2 ) and the pair (213, Xt±). This 
is sufficient for us to construct Q(x) as the product P; n! i2(2:i, x 2 ) P n , 34(2:3, X4). 
Now, we admit that in fact we also know P ou t(2/i, 2/2), which is a source of infor- 
mation about correlations between the pair (xi,x 2 ) and the pair (x 3 , X4). We 
make use of this information by forming the dimensionless ratio Po^yuyi) — 

J ° Pont (yi) Pout (V2) ' 

which differs from unity when yi and y 2 are correlated random variables (i.e. 
Pout (2/1, 2/2) i= Pout (2/1) Pout (2/2))- This ratio is greater (or less) than unity when 
the pair (j/i, y 2 ) is more (or less) likely to occur than would have been estimated 
from knowledge of the marginal PDFs P ou t(2/i) an d P ut(y 2 ) alone. Finally, we 
use this dimensionless ratio as a correction factor to obtain the expression for 
Q(x) shown in Equation[2] This derivation is heuristic, but it leads to the same 
result as shown in Equation [TJ 

In Figure[5]we present an alternative representation of the network in Figure 
[JJ in which we emphasise the PDFs that we use to construct Q(x). Thus we 
introduce a shorthand notation in which we use an oval to highlight each clique 
of nodes in the network. We define the word "clique" to mean "complete set of 
nodes having the same parent node". As is conventional when discussing tree- 
like networks, we regard the higher layers of the network as being the ancestors 
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Figure 2: The clique PDFs that we use in the basic 2-layer network. P ut(yi, 2/2) 
is the joint PDF of the pair of network outputs, and P ou t(yi) and Pout(j/2) are 
its two marginal PDFs. -Pout (sn, 3/2) — j g a di m ensionless ratio which records 
correlations between j/i and j/2- -Pin, 12(^1, ^2) and Pi n , 34(^3, £4) are the joint 
PDFs of the pairs of inputs from which y\ and 2/2 derive, respectively. 



of the lower layers, regardless of the fact that the direction of information flow 
is in the opposite direction through the tree. We then construct Q(x) as the 
product of the three clique PDFs shown, whilst ensuring that the clique in layer 
1 is appropriately normalised to render its contribution dimensionless. This 
leads to the form of Q(x) in Equation [2] 

This diagrammatic approach to constructing Q(x) may be readily extended 
to any tree-like feedforward network. We favour this approach, because the 
basic strategy for deriving Q{x) by invoking compression factors remains the 
same, but the burden of notational detail becomes somewhat heavy, so diagrams 
provide an ideal shortcut. For convenience, we summarise the prescription for 
constructing Q(x) from a tree-like diagram as follows: 

1. Estimate all of the clique PDFs, as histograms, for instance. 

2. Deduce all of the single- node marginal PDFs from the clique PDFs esti- 
mated in the previous step. For instance this would create P ou t(yi) and 
-Pout (2/2) from P out (yi, 2/2) ■ This step is not needed in in layer 0. 

3. From the results estimated in the previous two steps, for each clique com- 
pute a clique factor as follows: 

(a) In the input layer the factor is the clique PDF itself. 

(b) In other layers the factor is the clique PDF divided by the product 
of its marginal PDFs (e.g. pj^%g w) ). 

4. Finally, to construct Q(x), form the product of all of the clique factors 
estimated in the previous step. 



4 



2.2 Translation Invariance 



A disadvantage of the above prescription for constructing Q(x) is that it does 
not treat the components of x on an equal footing. For instance, in Equation 
[5] we see that the pair {x\,x-i) is treated differently from the pair (2:2,^3), even 
though both of these are pairs of adjacent components in the data. In order to 
solve this problem we construct a number of different tree-like networks, each 
of which breaks symmetry in its own peculiar way, and then we combine the 
results from each network to construct a composite Q(x) which respects the 
required symmetry. 
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Figure 3: An example of the 4 separate 2-layer networks that we need to combine 
in order to produce a Q(x) that treats each component of x on an equal footing. 
Figure 3a shows the basic 2-layer network, Figure 3b shows the same network 
with the layer 1 clique PDFs translated. Figure 3c and Figure 3d derive from 
Figure 3a and Figure 3b by simultaneously translating the clique PDFs in both 
network layers. 

In Figure [3] we show an example of the set of 4 different 2-layer networks which 
we need to combine in order to construct a composite Q(x). In this example we 
assume that the input is a high dimensional vector, so we can ignore edge effects. 
We replicate the basic network structure of Figure [5] across the input vector, as 
shown. Each of the 4 networks has its own set of clique PDFs (drawn as ovals 
in Figure [2]), each of which leads to its own estimate Q(x) which breaks sym- 
metry. However, a symmetric combination (such as the arithmetic or geometric 
mean) of these 4 results treats each component of x on an equal footing. We 
can verify this by noting that the set of cliques that contributes in the spatial 
neighbourhood of each component of x does not depend (apart from a trivial 
overall translation) on which component we select. 

We must select a prescription for forming the composite Q(x). It needs 
only to be a symmetric combination of the 4 individual estimates that we show 
in Figure [3J the arithmetic mean and geometric mean are obvious choices. On 
pragmatic grounds, we choose to use the geometric mean, because it corresponds 
to the arithmetic mean of logQ(a;), which is more convenient to perform in 
limited precision hardware (logQ(a;) has a much smaller dynamic range than 
Q(x), assuming that we avoid the logarithmic singularity). 
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Figure 4: Example of the composite network connectivity that we require in 
order for a single network to compute a composite Q(x), which treats each 
component of the input on an equal footing. This connectivity is the union of 
all of the binary trees can be generated from a reference binary tree (which we 
highlight in bold). 

In Figure [?] we show the connectivity of part of a 4-layer composite network 
that can be used to process the input data in preparation for constructing a 
composite Q(x). This connectivity contains all possible embedded tree-like 
networks, and in Figure [5] we highlight one such embedded tree for illustrative 
purposes. 

For an n-layer network, we form the composite Q(x) as the geometric mean 
over the Q{x) derived from all tree- like networks that are embedded in this 
composite network, to yield the geometric mean PDF Q gT[L (x) in the form 

n— 1 

log Q gm (x) = E^TiE lo g P k ( 3 ) 

L=0 k 

where L sums over layers to n — 1 of the network, k sums over cliques within 
a layer of the network, and P^ is the clique PDF at position k in layer L. It is 
important to note that the cliques are not simply adjacent nodes in each layer 
of the network. We must select pairs of nodes that form a "complete set of 
nodes having the same parent node". In layer this means that the nodes are 
adjacent. In layer 1 the nodes in a pair are separated by 1 intervening node. In 
layer 2 there are 3 intervening nodes, and so on. For L > 1 we must ensure that 
the P£ are dimensionless by dividing out the marginal PDFs, as in Equation 
H The 2^ factor ensures that we include each tree-like network exactly once, 
and that the final result is indeed the geometric mean of these contributions. 
Figure [3] shows the terms that Equation [3] generates when we set n = 2. 

There are two further assumptions that we could make in order to sim- 
plify our result even further. Firstly, we could assume that the layer-to-layer 
transformations in Figure 0] were independent of position k within each layer L. 
Secondly, we could assume that the clique PDFs were independent of position 
k within each layer L. We can make both of these assumptions if the statistical 
properties of the input data are known to be translationally invariant (such as 
might be the case for an image of a texture, for instance). In all of our numerical 
simulations we make these two simplifying assumptions. 

2.3 Modular Implementation 

We now describe a practical implemention of Equation [3] in the context of image 
processing (i.e. 2-dimensional arrays of pixels of data). There are three basic 
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operations to perform. We must use a training set to determine suitable layer- 
to-layer transformations, then estimate the clique PDFs in each network layer, 
and then construct logQ gm (x) from these estimates. Ideally we should opti- 
mise the layer-to-layer transformations directly so that the constructed Q gnl {x) 
is "close to" P{x) in some sense (e.g. relative entropy), but we have not yet 
found a computationally cheap way of doing this. Instead, we tackle the prob- 
lem indirectly, by using our existing multilayer topographic mapping network 
technique [10]. There are two main reasons for this choice. Firstly, this type of 
network is computationally cheap to train; we typically train such a network at 
the rate of 2.3 second per layer on a VAXstation 3100 workstation (assuming 
6 bit data values). Secondly, the network encodes the input in such a way as 
to be able to reconstruct it approximately from the state of any network layer. 
Although this second property does not in general imply that the encoded input 
is the optimal one for constructing an estimate of the input PDF, it turns out 
that it does produce useful results. 



iTiultilsyer 
topographic 
mapping 
subsystem 



PDF 

estimation 
subsystem 



—I— 




k 1 




output — 






image 






layer 


scale- If A 


layer i 



Figure 5: First two layers of a modular system for constructing Q gm (x). The 
top half of the diagram is a multilayer network subsystem (actually, a multi- 
layer topographic mapping in this case), which operates from left to right. The 
bottom half of the diagram is the PDF estimation subsystem, which operates 
from right to left. We connect the two systems by feeding logarithmic clique 
PDFs measured in the multilayer network through to be added together in the 
PDF estimator. 



In Figure [3] we show a system for constructing Q gm (x), which consists of 
two interconnected subsystems - a multilayer topographic mapping subsystem 
for transforming the input image, and a PDF estimation subsystem for forming 
an output image which contains the contributions to Q gm (x), each recorded 
in its spatially correct location in the image. For obvious reasons, we call the 
output image a "probability image". The flow from left to right across the top 
half of Figure [5] implements the network structure in Figure and the flow 
from right to left across the bottom half of Figure [5] progressively constructs the 
probability image. 

In Figure [S] the input image becomes layer of a multilayer network. In 
layer we extract a pair of adjacent pixels, and then pass it through a look-up 
table (or mapping) to yield a single value which we write into the appropriate 
pixel location in layer 1 (in Figure [1] this corresponds to transforming (xi,X2) 
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to become y\). We repeat this operation all over layer 0, to yield a whole array 
of transformed values in layer 1. There is an arbitrariness in our choice of the 
relative position of the pairs of pixels that we use (e.g. north-south, or east-west, 
etc). In our simulations we use a north-south relative position in the layer 0-to-l 
transformation, east-west in the layer l-to-2 transformation, and alternate these 
two choices thereafter as we progress from layer to layer of the network. Note 
also that the separation of the pairs of pixels is not the same in each layer. In 
Figure[5]the separation doubles as we progress from layer to layer, but in Figure 
[S]thc separation doubles after every two layers, because we must allow both the 
east-west and the north-south orientations to be processed at all separations 
(this is a consequence of processing 2-dimensional data through a 1-dimensional 
tree-structured network). If we concentrate only on the topology of the network 
that results from this prescription in Figure [SJ we discover that it is identical to 
the topology in Figure UJ Thus, the only difference between these two cases is 
the way in which we identify the pixels of the input data array with the layer 
nodes. 

We may use any transformation that we wish in the look-up table. We have 
not yet discovered a computationally cheap way of optimising the network in 
order to construct a Q(x) that best approximates the required P(x). Instead, we 
optimise the network in such a way that each layer could be used to reconstruct 
approximately the state of the previous layer. This is not the same optimisation 
problem, but it is computationally very cheap, and empirically it leads to useful 
results for Q(x). We choose to train the network as a multilayer topographic 
mapping, which we implement in a look-up table after the training schedule has 
ended. Typically, the largest number of bits per pixel that we use is 8, which 
corresponds to a look-up table with 65536 (= 2 2x8 ) separate addresses, each 
containing an 8 bit output value. 

When we have trained a sufficient number of layers, we may estimate the 
clique PDFs in each layer. We simply record these as histograms, without 
making any attempt to interpolate or smooth these estimates; later on we shall 
mention a number of caveats. This completes the left-to-right pass in the top 
half of Figure [5j 

In order to construct our geometric mean estimate Q gul (x) of P(x), we must 
combine the estimates of the clique PDFs. We may obtain the result in Equation 
[3] by appropriately scaling and summing the logarithms of the histograms (and 
their marginal histograms) in Figure [5j The method that we use depends on 
the following rearrangement of Equation [3] 



log<2 g m(a:) = 




in which we successively compute the contributions starting at network layer 
n— 1, and then work outwards towards layer 0. First of all we initialise all of the 
images in the PDF estimation subsystem to some constant value (say zero), and 
then commence at layer n — 1 (i.e. the righthandmost layer in Figure [S]). Using 
the notation of Figure [21 each clique in the multilayer topographic mapping sub- 
system contributes a term of the form log P out (yi , y 2 ) -log P ou t (yi ) -log P ou t (2/2 ) , 
which we add to the values stored in the two pixels that are located at the same 
clique position in the PDF estimation subsystem. In order to compensate for 



this double counting, and in order to account for the ~ factors that appear in 
Equation |U we scale the logarithmic value by a factor i (= | x i). We then 
progress layer by layer towards the left in Figure At each layer we generate 
its logarithmic contribution as above, but now we add to this the contribution 
from the layer on its right, as shown in Figure [5] and Equation 0] By cascading 
the results backwards from layer to layer of the network, we iteratively construct 
log<5 gm (a;) in the form shown in Equation @] Note that the layer cliques are 
slightly different, because they contribute terms of the form log Pi n . 12(^1, £2)- 

When all of these stages are complete, the output image in Figure [S] contains 
pixel values whose sum equals the required logQ gm (a;). The contribution to 
log<5 gm (a;) that is recorded in an output pixel derives from a (rectangular) 
region in the input image that surrounds the location of the output pixel, so the 
output image can be interpreted as an image of correctly spatially registered 
logarithmic probability contributions to logQ gm (a;). 

In our simulations we investigate how each individual layer of the multilayer 
network contributes to log Q gm (x), so we switch off all except one of the sources 
of logarithmic probability in Figure [5J which permits only a single layer of 
the network to contribute to the construction of the output image. Because 
each layer of the network typically is sensitive to statistical structure in the 
input image at only one length scale, the output image then typically reveals 
contributions to logQ gm (a;) at only one length scale. 

We should remark in passing that there are many other possible ways in 
which Figure [5J could be configured. Our results depend on an underlying tree- 
like structure, which we replicate to produce the translation invariant network 
in Figure 21 which we then use directly to produce the design in Figure [SJ In the 
case of a non-binary tree we must be careful to produce the correct generalisation 
of Figure [4] and Figure [5j but there are no new difficulties in principle. 

2.4 Algorithmic Details 

We compensate for some of the effects of non-uniform illumination of the scene 
in the input image by adding a grey scale wedge whose gradient we choose in 
such a way as to remove the linear component of the non-uniformity. This 
improves the assumed translation invariance of the image statistics. We do not 
attempt to perform a histogram equalisation on the input image, because the 
transformation from network layer to layer 1 tends to perform this function 
anyway. In order not to disrupt the discussion, we review the details of the 
topographic mapping training algorithm in the appendix. 

We choose to process the image in alternate directions using the following 
sequence: north/south, east/west, north/south, east/west, etc. This sequence 
leads to the following sequence of rectangular regions of the input image that 
influence the value in each pixel in each layer of the network: 1x2,2x2,2x4, 
4x4, etc, using (east /west, north/south) coordinates. In all of our experiments 
we use a 6-layer network, so the value in each pixel in the final layer is sensitive 
to an 8 x 8 region of the input image. 

The number of bits per pixel B that we use in each layer of the network 
determines the quality of the topographic mappings (the B bit output from a 
topographic mapping is the index of the winner from amongst 2 B competing 
"neurons") . Increasing B improves the quality of the mapping but increases the 
training time; we need to compromise between these two conflicting require- 
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ments. In our work on simple Brodatz texture images we find that choosing 
B to lie between 6 and 8 proves to be sufficient. Note that we choose to use 
the same number of bits per pixel in each layer of the network. In general this 
restriction is not necessary. 

It is important to note that for a given value of B there is an upper limit 
on the allowed entropy (per unit area) that the input data should have. A 
hierarchically connected multilayer topographic mapping network progressively 
squeezes the input data through an ever smaller bottleneck (in fact there are 
multiple parallel bottlenecks due to the overlapping tree structure) as we pass 
through the layers of the network. There is a upper limit to the number of 
network layers beyond which it simply cannot preserve information that is useful 
in estimating the joint density of the input data, which limits the capabilities 
of our current method. 

The choice of the size of the histogram bins is also important. A property 
of the multilayer topographic mapping network is that adjacent histogram bins 
derive from input vectors that are close to each other (in the Euclidean sense), 
so it makes sense to rebin the histogram by adding together the contents of 
adjacent bins. We may easily control the histogram bin size by truncating the 
low order bits of each pixel value. If we truncate b low order bits of each pixel 
value, then effectively we smooth the histogram over 2 b adjacent bins (for each 
dimension of the histogram). As we smooth the histogram it will suffer from 
less noise, but we run the danger of smoothing away significant structure that 
might usefully be used to characterise the statistics of the input image; so we 
need to make a compromise. 

It is most important not to use histogram bins that are too small. A large 
number of small histogram bins would record the details of the statistical fluctu- 
ations of the training image (as particular realisations of a Poisson noise process 
in each bin), and would act as a detailed record of the structure in the training 
image, and thus be unable to generalise very well. Such histograms would look 
very spiky, and in extreme cases there might be counts recorded in only a few 
bins, with zeros in all of the remaining bins. If this situation were to occur, then 
the training image would have a large Q gm (x), whereas a test image (having the 
same statistical properties) would have a small Q gm (x). The cause of this prob- 
lem is the absence of a significant overlap between the spikes in the training and 
test image histograms, which could be avoided by ensuring that the histogram 
bins are not too small. Generally, we find that a little experimentation can be 
used to determine a robust histogram binning strategy, so we do not attempt 
to implement a more sophisticated technique here. 

Finally, we display the contributions to logQ gm (a;) as follows. We determine 
the range of pixel values that occurs in the image, and we translate and scale 
this into the range [0, 255]. This ensures that the smallest logarithmic probabil- 
ity appears as black, and the largest logarithmic probability appears as white, 
and all other values are linearly scaled onto intermediate levels of grey. This 
prescription has its dangers because each image determines its own special scal- 
ing, so one should be careful when comparing two different images. It can also 
be adversely affected by pixel value outliers arising from Poisson noise effects, 
where an extreme value of a single pixel could affect the way in which the whole 
of an image is displayed. However, we find that the overlapping tree structure 
of our multilayer network causes enough averaging together of individual contri- 
butions Q(x) that the composite result Q gm (x) does not suffer from problems 
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due to pixel value outliers. 



3 Application to the Detection of Anomalies in 
Textures 

In this section we present the results of applying the system shown in Figure 
[5] to four 256 x 256 images of textures taken from the Brodatz texture album 
|13j . In all cases we compensate for uneven illumination by introducing a grey 
scale wedge as we explained earlier, we use 8 bits per pixel for the topographic 
mappings, we use 6 bits per pixel for histogramming, and we invert the [0, 255] 
scale to represent the contributions to logQ gm (a;) in such a way that white 
pixels indicate a small (rather than a large) contribution to logQ gm (a;). Thus 
white pixels in the output image correspond to regions of the input image whose 
statistical properties differ markedly from the statistics averaged over the whole 
image. We usually call this representation of the contributions to logQ gm (x) 
an "anomaly image". 

We do not present these results as necessarily being an efficient way of detect- 
ing texture anomalies. Rather, we merely apply our novel method of estimating 
PDFs, as expressed in Equation [3] and in Figure [SJ to the particular problem of 
texture analysis, because this is an effective way of demonstrating some of the 
more interesting properties of log<3 gm (a;). 

3.1 Texture 1 

In Figure |B] we show the first Brodatz texture image that we use in our experi- 
ments. The image is slightly unevenly illuminated and has a fairly low contrast, 
but nevertheless its statistical properties are almost translation invariant. 




Figure 6: 256x256 image of Brodatz image 1. 
In Figure [7] we show the anomaly images that derive from Figure [6] Note 
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how the anomaly images become smoother as we progress from Figure [TJa, to 
Figure [7f, due to the increasing amount of averaging that occurs amongst the 
overlapping trees in the network. 




Figure 7: 256x256 anomaly images of Brodatz image 1. 

Figure [JJ5 and Figure [7f reveal a highly localised anomaly in the original 
image. Figure [7f corresponds to a length scale of 8 x 8 pixels, which is the 
approximate size of the fault that is about j of the way down and slightly to 
the left of centre of Figure [51 The fault does not show up clearly on the other 
figures in Figure [7J because their characteristic length scales are either too short 
or too long to be sensitive to the fault. 

From Figure [JJ we conclude that ACE can easily pick out localised faults in 
highly ordered textures. 
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3.2 Texture 2 




Figure 8: 256x256 image of Brodatz image 2. 

In Figure [5] we show the second Brodatz texture image that we use in our 
experiments. The image has a high contrast and translation invariant statistical 
properties. 




Figure 9: 256x256 anomaly images of Brodatz image 2. 

In Figure[5]we show the anomaly images that derive from Figure[5J The most 
interesting anomaly image is Figure [5f which shows several localised anomalies. 
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About halfway down and to the left of centre of the image is an anomaly that 
corresponds to a dark spot on the thread in Figure |SJ The brightest of the 
anomalies in the cluster just above the centre of the image corresponds to what 
appears to be a slightly torn thread in Figure [5] The other anomalies in this 
cluster are weaker, and correspond to slight distortions of the threads. There is 
another anomaly just below and to the right of the centre of Figure |Hf. which 
corresponds to what appears to be another slightly torn thread in Figure [5] 
These anomalies all occur at, or around, a length scale of 8 x 8 pixels. Several 
of the anomaly images show an anomaly in the bottom left hand corner of the 
image, which corresponds to a small uniform patch of fabric in Figure [5] 

The results in Figure|9]corroborate the evidence in Figure[7]that we can train 
ACE to pick out localised anomalies in highly structured textures. This type of 
texture could be analysed much more simply by model-based techniques that 
took advantage of their near-periodicity. However, that does not detract from 
the fact that, by making use of its adaptability, ACE succeeds in modelling these 
textures without prior knowledge of their near-periodicity. We seek a general 
purpose approach to density estimation; not a toolkit of different (usually model- 
based) techniques, each tuned to its own type of problem. 

3.3 Texture 3 




Figure 10: 256x256 image of Brodatz image 3. 

In Figure [TO] we show the third Brodatz texture image that use in our experi- 
ments. The image has a very high contrast and statistical properties that are 
almost translation invariant. However the density of anomalies is much higher 
than in either Figure [5] or Figure [51 



14 



Figure 11: 256x256 anomaly images of Brodatz image 3. 



In Figure[TTJwe show the anomaly images that derive from Figure[TU] At the 
lower left hand corner of Figure ITTf there is a large anomaly that corresponds 
to a region of Figure fTOl that is distorted to the left. Figure [TTf is sensitive to 
a length scale of 8 x 8 pixels, so it does not respond to this leftward distortion 
(which occurs on a length scale of around 32 x 32 pixels), rather it responds to 
localised variations in the separations of the threads. 

There are numerous other anomalies in Figure [TQ1 some are detected in 
Figure [Til an d some are not. The ability of ACE to pick out anomalies degrades 
as the density of anomalies increases. This is because the anomalies themselves 
are part of the statistical properties that are extracted by ACE from the training 
image, and if a particular type of anomaly occurs often enough in the image 
then it is no longer deemed to be an anomaly. In extreme cases there is also 
the possibility that the entropy (per unit area) of the input image can saturate 
ACE and thus degrade its performance, as we discussed earlier. 

3.4 Texture 4 

In this section we present a slightly different type of experiment in which we train 
ACE on one image and test ACE on another image. To create the two images 
we start with a single 256 x 256 image of a Brodatz texture, which we divide 
into a left half and a right half. We then use the left half to construct a training 
image, and the right half to construct a test image. Note that in constructing 
these images we scrupulously avoid the possibility that the training and test 
images contain elements deriving from a common source, although there are 
some small residual correlations between the two images along their common 
edge. 
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Figure 12: 256x256 image of a Brodatz image of a carpet for training. 

In Figure [12] we show the training image which is a montage of two copies of 
the left hand half of a Brodatz texture image. Note that we use square, rather 
than rectangular, images because our software is restricted to processing this 
type of image. 




Figure 13: 256x256 image of a Brodatz image of a carpet for testing. 

In Figure \T3\ we show the test image which is a montage of two copies of 
the right hand half of a Brodatz texture image, and superimposed on that is a 
64 x 64 patch which we generated by flipping the rows and columns of a copy 
of the top left hand corner of this image. This patch is a hand-crafted anomaly. 
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Figure 14: 256x256 anomaly images of a Brodatz image of a carpet. 

In Figure [Tl] we show the anomaly images that derive from Figure Q2] after 
we train on Figure[T2J Figure [T4T shows the strongest response to the anomalous 
patch in the centre of the image, corresponding to anomaly detection on a length 
scale of 8 x 8 pixels. 

4 Conclusions 

We present a novel method of density estimation in high-dimensional spaces, 
such as images. In Bayesian data processing there is a pressing need for a flexible 
way of constructing such estimates, because the basic objects that we manipulate 
in Bayesian analysis are joint PDFs, which we must somehow construct in the 
first place. We call the hierarchical network structure that emerges from our 
analysis an "Adaptive Cluster Expansion", or ACE for short. 

ACE is computationally very cheap: we can train a multilayer topographic 
mapping network to estimate the joint PDF of its input data at the rate of 1 
network layer every 2.3 second (on a VAXstation 3100, and assuming 6 bits per 
pixel), where each layer analyses one length scale (power of 2) in the input data. 
We find, in our experiments with Brodatz textures, that 6 network layers allows 
the detection of statistical anomalies in the textures. This result is not universal, 
because it must depend strongly on the scale at which the anomalous statistical 
structure in the data is to be found. Although we demonstrate ACE only in 
a texture anomaly detection role, its scope is far greater than this. ACE is a 
general purpose, and computationally cheap, network for estimating densities 
in high-dimensional spaces. 

For completeness, we should mention that the performance of ACE in its 
current form has two fundamental limitations. Firstly, we assume that the net- 
work connectivity is fixed, and that its functionality is determined by a training 
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algorithm. This restricts the possible statistical properties of the input data 
that could be estimated. Secondly, ACE is based upon a hierarchically con- 
nected multilayer topographic mapping network, which progressively squeezes 
the input data through an ever smaller bottleneck as we pass through the layers 
of the network. There is a upper limit to the number of network layers beyond 
which ACE simply cannot preserve information that is useful in estimating the 
statistics of the input data. For instance, the statistics of an extremely noisy 
image of a texture can not be successfully estimated by ACE, because the noise 
entropy would saturate ACE before the statistics of the underlying texture could 
be investigated. This problem can be solved by introducing explicit noise models 
into ACE, which we shall report elsewhere. 

A Appendix 

The standard topographic mapping training procedure in [7] is a rather ineffi- 
cient algorithm. In |10| we present in detail an efficient training procedure for 
topographic mappings, and explain how to use it to train multilayer topographic 
mappings. 

For convenience, we introduce some notation. 
x = input vector 

y = index of the winning "neuron" 

x{y) = reference vector associated with y 

Tt(y' — y) = topographic neighbourhood function (normalised to unit total 
mass) 

e = update parameter used during training 
N = number of reference vectors 

A.l Standard Topographic Mapping Training Algorithm 

The standard topographic mapping training procedure is essentially as follows 

1. Select a training vector x at random from the training set. 

2. Map x to y by using a nearest neighbour prescription applied to the dis- 
tance of x from each of the current set of reference vectors. 

3. For all y', move the reference vector x(y') directly towards the input vector 
a; by a distance eir(y' — y) \\x — 

4. Go to step 1. 

Repeat this loop as often as is required to ensure convergence of the reference 
vectors. 

The standard training method specifies that 7r(j/' — y) should be an even 
unimodal function whose width should be gradually decreased as training pro- 
gresses. This allows coarse-grained organisation of the reference vectors to oc- 
cur, followed progressively by ever more fine-grained organisation, until finally 
the algorithm converges to an optimum set of reference vectors. In a similar 
vein, the relative size of the update step e should also be steadily decreased as 
training progresses. 
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A. 2 Modified Topographic Mapping Training Algorithm 



In our own modification |10| of the standard topographic mapping training 
we replace a shrinking 7r(y' — y) function acting on a fixed number of reference 
vectors, by a fixed Tr(y' — y) function acting on an increasing number of reference 
vectors. There are many minor variations on this theme, but we find that it is 
sufficient to define 



where we absorb the e into the definition of Tr(y' — y). We increase the number 
of reference vectors in a binary sequence (i.e. N = 2,4,8, 16,32, • • •), and we 
initialise each generation of reference vectors by interpolation from the previous 
generation. We find that the following parameter values yield adequate conver- 
gence: e — 0.1, e' = 0.05, and we perform 20iV training updates before doubling 
the value of N, as above. We initialise the N = 2 pair of reference vectors as a 
random pair of vectors chosen from the training set. 

In numerous experiments, we find that this modified form of the topographic 
mapping training algorithm converges much more rapidly than the standard 
method. Furthermore, the binary sequence of N values lends itself well to 
implementing the trained topographic mapping using a look-up table. 
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